Search CORE

122 research outputs found

Distant-talking speaker identification by generalized spectral subtraction-based dereverberation and its efficient computation

Author: Atsuhiko Kai
Longbiao Wang
Zhaofeng Zhang
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

Springer - Publisher Connector

PLDA in the i-supervector space for text-independent speaker verification

Author: Kong Aik Lee
Longbiao Wang
Ye Jiang
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

Springer - Publisher Connector

Deep Multi-task Multi-label CNN for Effective Facial Attribute Classification

Author: Mao Longbiao
Wang Hanzi
Xue Jing-Hao
Yan Yan
Publication venue
Publication date: 23/01/2020
Field of study

Facial Attribute Classification (FAC) has attracted increasing attention in computer vision and pattern recognition. However, state-of-the-art FAC methods perform face detection/alignment and FAC independently. The inherent dependencies between these tasks are not fully exploited. In addition, most methods predict all facial attributes using the same CNN network architecture, which ignores the different learning complexities of facial attributes. To address the above problems, we propose a novel deep multi-task multi-label CNN, termed DMM-CNN, for effective FAC. Specifically, DMM-CNN jointly optimizes two closely-related tasks (i.e., facial landmark detection and FAC) to improve the performance of FAC by taking advantage of multi-task learning. To deal with the diverse learning complexities of facial attributes, we divide the attributes into two groups: objective attributes and subjective attributes. Two different network architectures are respectively designed to extract features for two groups of attributes, and a novel dynamic weighting scheme is proposed to automatically assign the loss weight to each facial attribute during training. Furthermore, an adaptive thresholding strategy is developed to effectively alleviate the problem of class imbalance for multi-label learning. Experimental results on the challenging CelebA and LFWA datasets show the superiority of the proposed DMM-CNN method compared with several state-of-the-art FAC methods

arXiv.org e-Print Archive

UCL Discovery

Dereverberation Based on Spectral Subtraction by Multi-channel LMS Algorithm for Hands-free Speech Recognition

Author: Atsuhiko Kai
Kyohei Odani
Longbiao Wang
Norihide Kitaoka
Seiichi Nakagawa
Publication venue: 'IntechOpen'
Publication date: 28/11/2012
Field of study

IntechOpen

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

Author: Dang Jianwu
Fu Ruibo
Li Hao
Qiang Chunyu
Tian Yixin
Wang Longbiao
Wang Tao
Publication venue
Publication date: 06/09/2023
Field of study

For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic information such as speaker identity and acoustic details should be de-emphasized. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Contrastive learning is a good method for modeling intermediate representations from two modalities. However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. The CTAP model is trained on 210k speech and phoneme text pairs, achieving minimally-supervised TTS, VC, and ASR. The proposed CTAP method offers a promising solution for fine-grained generation and recognition downstream tasks in speech processing

arXiv.org e-Print Archive

speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

Author: Dang Jianwu
Li Nan
Lu Haoyu
Song Tongtong
Wang Longbiao
Wang Xiaobao
Zhang Shiliang
Publication venue
Publication date: 30/05/2023
Field of study

In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with different intensities. Furthermore, speech distortion and residual noise are often observed in enhanced speech, and the distortion of speech and noise is different. Most existing methods focus on fusing enhanced and noisy features to address this issue. In this paper, we propose a dual-stream spectrogram refine network to simultaneously refine the speech and noise and decouple the noise from the noisy input. Our proposed method can achieve better performance with a relative 8.6% CER reduction

arXiv.org e-Print Archive

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

Author: Dang Jianwu
Fu Ruibo
Li Hao
Ni Hao
Qiang Chunyu
Qu He
Wang Longbiao
Wang Tao
Publication venue
Publication date: 28/07/2023
Field of study

Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. To address the challenges associated with high dimensionality and waveform distortion in discrete representations, we propose Diff-LM-Speech, which models semantic embeddings into mel-spectrogram based on diffusion models and introduces a prompt encoder structure based on variational autoencoders and prosody bottlenecks to improve prompt representation capabilities. Autoregressive language models often suffer from missing and repeated words, while non-autoregressive frameworks face expression averaging problems due to duration prediction models. To address these issues, we propose Tetra-Diff-Speech, which designs a duration diffusion model to achieve diverse prosodic expressions. While we expect the information content of semantic coding to be between that of text and acoustic coding, existing models extract semantic coding with a lot of redundant information and dimensionality explosion. To verify that semantic coding is not necessary, we propose Tri-Diff-Speech. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples

arXiv.org e-Print Archive